<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title>syslog parsing in  rsyslog</title>
</head>
<body>
<h1>syslog parsing in rsyslog</h1>
<p><small><i>Written by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a>
(2008-09-23)</i></small></p>
<p><b>We regularly receive messages asking why <a href="http://www.rsyslog.com">rsyslog</a>
parses this or that message incorrectly.</b> Of course, it turns out that rsyslog does
the right thing, but the message sender does not. And also of course, this is not even
of the slightest help to the end user experiencing the problem ;). So I thought I write this
paper. It describes the problem source and shows potential solutions (aha!).
<h2>Syslog Standardization</h2>
The syslog protocol has not been standardized until relatively recently.The first document "smelling" a bit
like a standard is <a href="http://www.ietf.org/rfc/rfc3164.txt">RFC 3164</a>, which dates back
to August 2001. The problem is that this document is no real standard. It has assigned "informational"
status by the <a href="http://www.ietf.org">IETF</a> which means it provides some hopefully
useful information but does not demand anything. It is impossible to "comply" to an informational
document. This, of course, doesn't stop marketing guys from telling they comply to RFC3164 and
it also does not stop some techs to tell you "this and that does not comply to RFC3164, so it is
&lt;anybody else but them&gt;'s fault".
<p>Then, there is <a href="http://www.ietf.org/rfc/rfc3195.txt">RFC3195</a>, which is 
a real standard. In it's section 3 it makes (a somewhat questionable) reference to (informational)
RFC 3164 which may be interpreted in a way that RFC3195 standardizes the format layed out
in RFC 3164 by virtue of referencing them. So RFC3195 seems to extend its standardization
domain to the concepts layed out in RFC 3164 (which is why I tend to find that refrence
questionable). In that sense, RFC3195 standardizes the format informationally described in
RFC3164, Section 4. But it demands it only for the scope of RFC3195, which is syslog over
BEEP - and NOT syslog over UDP. So one may argue whether or not the RFC3164 format could
be considered a standard for any non-BEEP (including UDP) syslog, too. In the strict view
I tend to have, it does not. Refering to the RFC3195 context usually does not help,
because there are virtually no RFC3195 implementations available (at this time, 
I would consider this RFC a failure).
<p>Now let's for a short moment assume that RFC3195 would somehow be able to demand
RFC3164 format for non-BEEP syslog. So we could use RFC3164 format as a standard. But does
that really help? Let's cite RFC 3164, right at the begining of section 4 (actually, this
is the first sentence):
<blockquote>
<pre>
   The payload of any IP packet that has a UDP destination port of 514
   MUST be treated as a syslog message. 
<pre>
</blockquote>
<p>Think a bit about it: this means that whatever is send to port 514 must be considered
a valid syslog message. No format at all is demanded. So if "this is junk" is sent to
UDP port 514 - voila, we have a valid message (interestingly, it is no longer a syslog
message if it is sent to port 515 ;)). You may now argue that I am overdoing. So let's
cite RFC 3164, Section 5.4, Example 2:
<blockquote>
<pre>
  Example 2

        Use the BFG!

   While this is a valid message, it has extraordinarily little useful
   information.
</pre>
</blockquote>
<p>As you can see, RFC3164 explicitely states that no format at all is required.
<p>Now a side-note is due: all of this does not mean that the RFC3164 authors
did not know what they were doing. No, right the contrary is true: RFC3164 mission
is to describe what has been seen in practice as syslog messages and the 
conclusion is quite right that there is no common understanding on the
message format. This is also the reason why RFC3164 is an informational document:
it provides useful information, but does not precisely specify anything.
<p>After all of this bashing, I now have to admit that RFC3164 has some format
recommendations layed out in section 4. The format described has quite some
value in it and implementors recently try to follow it. This format is usually meant
when someone tells you that a software is "RFC3164 compliant" or expects "RFC3164 compliant messages".
I also have to admit that rsyslog also uses this format and, in the sense outlined here,
expects messages received to be "RFC3164 compliant" (knowingly that such a beast does not
exist - I am simply lying here ;)).
<p>Please note that there is some relief of the situation in reach. There is a new normative
syslog RFC series upcoming, and it specifies a standard message format. At the time of
this writing, the main documents are sitting in the RFC editor queue waiting for a transport
mapping to be completed. I personally expect them to be assigned RFC numbers in 2009.
<h2>Practical Format Requirements</h2>
<p>From a practical point of view, the message format expected (and generated by 
default in legacy mode) is:
<pre><code>
&lt;PRI&gt;TIMESTAMP SP HOST SP TAG MSG(Freetext)
</code></pre>
<p>SP is the ASCII "space" character and the definition of the rest of the fields
can be taken from RFC3164. Please note that there also is a lot of confusion on what
syntax and semantics the TAG actually has. This format is called "legacy syslog" because
it is not well specified (as you know by now) and has been "inherited from the real world".
<p>Rsyslog offers two parsers: one for the upcoming RFC series and one for legacy format. We 
concentrate on the later. That parser applies some logic to detect missing hostnames,
is able to handle various ways the TIMESTAMP is typically malformed. In short it applies
a lot of guesswork in trying to figure out what a message really means. I am sure the
guessing algorithm can be improved, and I am always trying that when I see new malformed
messages (and there is an ample set of them...). However, this finds its limits where
it is not possible to differentiate between two entities which could be either.
For example, look at this message:
<pre><code>
&lt;144&gt;Tue Sep 23 11:40:01 taghost sample message
</code></pre>
<p>Does it contain a hostname? Mabye. The value "taghost" is a valid hostname. Of course, it is
also a valid tag. If it is a hostname, the tag's value is "sample" and the msg value is "message".
Or is the hostname missing, the tag is "taghost" and msg is "sample message"? As a human, I tend
to say the later interpretation is correct. But that's hard to tell the message parser (and, no, I do
not intend to apply artificial intelligence just to guess what the hostname value is...).
<p>One approach is to configure the parser so that it never expects hostnames. This becomes problematic
if you receive messages from multiple devices. Over time, I may implement parser conditionals,
but this is not yet available and I am not really sure if it is needed comlexity...
<p>Things like this, happen. Even more scary formats happen in practice. Even from mainstream
vendors. For example, I was just asked about this message (which, btw, finally made me
write this article here):
<pre></code>
"<130> [ERROR] iapp_socket_task.c 399: iappSocketTask: iappRecvPkt returned error"
</code></pre>
<p>If you compare it with the format RFC3164 "suggests", you'll quickly notice that
the message is "a bit" malformed. Actually, even my human intelligence is not sufficient
to guess if there is a TAG or not (is "[ERROR]" a tag or part of the message). I may not be
the smartest guy, but don't expect me to program a parser that is smarter than me.
<p>To the best of my konwledge, these vendor's device's syslog format can be configured, so it
would proabably be a good idea to include a (sufficiently well-formed) timestamp,
the sending hostname and (maybe?) a tag to make this message well parseable.
I will also once again take this sample and see if we can apply some guesswork.
For example, "[" can not be part of a well-formed TIMESTAMP, so logic can conclude
there is not TIMESTAMP. Also, "[" can not be used inside a valid hostname, so
logic can conclude that the message contains no hostname. Even if I implement this
logic (which I will probably do), this is a partial solution: it is impossible to
guess if there is a tag or not (honestly!). And, even worse, it is a solution only for
those set of messages that can be handled by the logic described. Now consider this 
hypothetical message:
<pre></code>
"<130> [ERROR] host.example.net 2008-09-23 11-40-22 PST iapp_socket_task.c 399: iappSocketTask: iappRecvPkt returned error"
</code></pre>
<p>Obviously, it requires additional guesswork. If we iterate over all the cases, we
can very quickly see that it is impossible to guess everything correct. In the example above
we can not even surely tell if PST should be a timezone or some other message property.
<p>A potential solution is to generate a parser-table based parser, but this requires
considerable effort and also has quite some runtime overhead. I try to avoid this for
now (but I may do it, especially if someone sponsors this work ;)). Side-note: if you want
to be a bit scared about potential formats, you may want to have a look at my paper
<i>&quot;<a href="http://www.monitorware.com/en/workinprogress/nature-of-syslog-data.php">On the Nature of Syslog Data</a>&quot;</i>.
<h2>Work-Around</h2>
<p><b>The number one work-around is to configure your devices so that they emit
(sufficiently) well-formed messages.</b> You should by now know what these look
like.
<p>If that cure is not available, there are some things you can do in rsyslog to
handle the situation. First of all, be sure to read about
<a href="rsyslog_conf.html">rsyslog.conf format</a>
and the <a href="property_replacer.html">property replacer and properties</a> specifically.
You need to understand that everything is configured in rsyslog. And that the message is parsed
into properties. There are also properties available which do not stem back directly to parsing.
Most importantly, %fromhost% property holds the name of the system rsyslog received
the message from. In non-relay cases, this can be used instead of hostname. In relay cases,
there is no cure other than to either fix the orginal sender or at least one of the
relays in front of the rsyslog instance in question. Similarly, you can use %timegenerated%
instead of %timereported%. Timegenerated is the time the message hit rsyslog for the first
time. For non-relayed, locally connected peers, Timegenerated should be a very close approximation
of the actual time a message was formed at the sender (depending, of course, on potential
internal queueing inside the sender).
Also, you may use the 
%rawmsg% property together with the several extraction modes the property replacer supports.
Rawmsg contains the message as it is received from the remote peer. In a sense, you can
implement a post-parser with this method.
<p>To use these properties, you need to define your own templates and assign them. Details
can be found in the above-quoted documentation. Just let's do a quick example. Let's say
you have the horrible message shown above and can not fix the sending device for
some good reason. In rsyslog.conf, you used to say:
<pre><code>
*.* /var/log/somefile
</code></pre>
<p>Of course, things do not work out well with that ill-formed message. So you decide
to dump the rawmsg to the file and pull the remote host and time of message generation
from rsyslog's internal properties (which, btw, is clever, because otherwise there is no
indication of these two properties...). So you need to define a template for that and
make sure the template is used with your file logging action. This is how it may look:
<pre><code>
$template, MalfromedMsgFormater,"%timegenerated% %fromhost% %rawmsg:::drop-last-lf%\n"
*.* /var/log/somefile;MalformedMsgFormatter
</code></pre>
<p>This will make your log much nicer, but not look perfect. Experiment a bit
with the available properties and replacer extraction options to fine-tune it
to your needs.
<h2>Wrap-Up</h2>
<p>Syslog message format is not sufficiently standardized. There exists a weak
"standard" format, which is used by a good number of implementations. However, there
exist many others, including mainstream vendor implementations, which have a 
(sometimes horribly) different format. Rsyslog tries to deal with anomalies but
can not guess right in all instances. If possible, the sender should be configured
to submit well-formed messages. If that is not possible, you can work around these
issues with rsyslog's property replacer and template system.
<p>I hope this is a useful guide. You may also have a look at the 
<a href="troubleshoot.html">rsyslog troubleshooting guide</a> for further help and places where
to ask questions.
<p>[<a href="manual.html">manual index</a>] [<a href="http://www.rsyslog.com/">rsyslog site</a>]</p>
<p><font size="2">This documentation is part of the
<a href="http://www.rsyslog.com/">rsyslog</a> project.<br>
Copyright &copy; 2008 by <a href="http://www.gerhards.net/rainer">Rainer
Gerhards</a> and <a href="http://www.adiscon.com/">Adiscon</a>.
Released under the GNU GPL version 3 or higher.</font></p>
</body></html>
